The ground truth about metadata and community detection in networks
نویسندگان
چکیده
Across many scientific domains, there is a common need to automatically extract a simplified view or coarse-graining of how a complex system's components interact. This general task is called community detection in networks and is analogous to searching for clusters in independent vector data. It is common to evaluate the performance of community detection algorithms by their ability to find so-called ground truth communities. This works well in synthetic networks with planted communities because these networks' links are formed explicitly based on those known communities. However, there are no planted communities in real-world networks. Instead, it is standard practice to treat some observed discrete-valued node attributes, or metadata, as ground truth. We show that metadata are not the same as ground truth and that treating them as such induces severe theoretical and practical problems. We prove that no algorithm can uniquely solve community detection, and we prove a general No Free Lunch theorem for community detection, which implies that there can be no algorithm that is optimal for all possible community detection tasks. However, community detection remains a powerful tool and node metadata still have value, so a careful exploration of their relationship with network structure can yield insights of genuine worth. We illustrate this point by introducing two statistical techniques that can quantify the relationship between metadata and community structure for a broad class of models. We demonstrate these techniques using both synthetic and real-world networks, and for multiple types of metadata and community structures.
منابع مشابه
Metadata vs. Ground-truth: A Myth behind the Evolution of Community Detection Methods
A community detection (CD) method is usually evaluated by what extent it is able to discover the ‘ground-truth’ community structure of a network. A certain ‘node-centric metadata’ is used to dene the ground-truth partition. However, nodes in real networks often have multiple metadata types (e.g., occupation, location); each can potentially form a ground-truth partition. Our experiment with 10 ...
متن کاملCommunity detection: effective evaluation on large social networks
While many recently proposed methods aim to detect network communities in large datasets, such as those generated by social media and telecommunications services, most evaluation (i.e., benchmarking) of this research is based on small, hand-curated datasets. We argue that these two types of networks differ so significantly that by evaluating algorithms solely on the smaller networks, we know li...
متن کاملCommunity detection in networks: Structural communities versus ground truth.
Algorithms to find communities in networks rely just on structural information and search for cohesive subsets of nodes. On the other hand, most scholars implicitly or explicitly assume that structural communities represent groups of nodes with similar (nontopological) properties or functions. This hypothesis could not be verified, so far, because of the lack of network datasets with informatio...
متن کاملHidden Community Detection in Social Networks
We introduce a new paradigm that is important for community detection in the realm of network analysis. Networks contain a set of strong, dominant communities, which interfere with the detection of weak, natural community structure. When most of the members of the weak communities also belong to stronger communities, they are extremely hard to be uncovered. We call the weak communities the hidd...
متن کاملEvaluating Overfit and Underfit in Models of Network Community Structure
A common data mining task on networks is community detection, which seeks an unsupervised decomposition of a network into structural groups based on statistical regularities in the network’s connectivity. Although many methods now exist, the recently proved No Free Lunch theorem for community detection implies that each makes some kind of tradeoff, and no algorithm can be optimal on all inputs....
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره 3 شماره
صفحات -
تاریخ انتشار 2017